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Abstract. In this work we present work in progress on functionality du- 
plication detection in logic programs. Eliminating duplicated functional- 
ity recently became prominent in context of refactoring. We describe a 
quantitative approach that allows to measure the "similarity" between 
two predicate definitions. Moreover, we show how to compute a so-called 
"fingerprint" for every predicate. Fingerprints capture those character- 
istics of the predicate that are significant when searching for duplicated 
functionality. Since reasoning on fingerprints is much easier than reason- 
ing on predicate definitions, comparing the fingerprints is a promising 
direction in automated code duplication in logic programs. 



1 Introduction 

Refactoring [lOj is a source-to-source program transformation that changes pro- 
gram structure and organization, but not program functionality. The major aim 
of refactoring is to improve readability, maintainability and extensibility of the 
existing software. Refactoring has been shown to be profitable both for develop- 
ing new software and for maintaining existing software. Refactoring [16j consists 
of series of small transformation steps, also known as refactorings. For each 
step, an appropriate code fragment and an appropriate transformation have to 
be chosen, the transformation has to be executed and evaluated. In this paper 
we restrict our attention to the first step, namely identifying potential for trans- 
formation application. A number of refactorings aim at eliminating duplicated 
code (or better: duplicated functionality) and therefore automatic detection of 
code duplication becomes a necessity. 

Code duplication can be caused by a number of reasons. First of all, it can 
result from unfamiliarity of the developer with the existing code body. Second, 
the "copy and paste" technique is commonly used when the existing functionality 
has to be slightly adapted. Although in this case one usually does not end up 
literally duplicating the code, the changes introduced by adaptation are usually 



relatively minor and a generalization of the original and the adapted fragments 
can be often proposed. Finally, code duplication might result from a polyvariant 
program analysis [T7] . 

This being true for any programming paradigm we concentrate on logic pro- 
gramming (LP). While code duplication detection for imperative and object- 
oriented programming languages has been often studied in the past |3I4I9I11I12I13I15I18| , 
this topic has attracted less research attention in the logic programming com- 
munity. To the best of our knowledge the only results on code duplication in 
LP are due to Vanhoof [3T] motivated by the study of refactoring techniques for 
logic programs [20] . 

In a logic programming setting, we say that two predicates are duplicates 
if their definitions are identical up to a consistent renaming of variables and a 
permutation of the arguments. Consider for example the predicates append/3 
and concat/3 depicted below: 

append( [] ,L,L) . 

append( [XlXs] ,Y, [X|Zs]) :- append(Xs,Y,Zs) . 
concat(L, [] ,L) . 

concat ( [E I Zs] , [E|Es] ,Y) :- concat(Zs,Es,Y) . 

Even though the above predicate definitions are not literal copies of one another, 
intuitively it is clear that they are meant to perform the same operation, i.e. list 
concatenation. In addition to support variable renaming and argument permu- 
tation, we would like to have our notion of code duplication to be to some extent 
independent from the order of the clauses and the order of the atoms (including 
the unifications) in the clause bodies. As a rather trivial example, reconsider the 
definition of the append/3 predicate, this time written in a kind of normal form, 
where the unifications have been moved from the head to the body: 

append(X,Y,Z) :- X = [] , Z = Y. 

append(X,Y,Z) :- X = [Xe I Xs] , Z = [XelZs], append(Xs,Y,Zs) . 

In the definition above, one could easily switch the order of the unifications 
while the resulting predicate could still be considered a duplicate of the origi- 
nal append/3. This degree of liberty one has in organizing the code makes ap- 
proaches based on textual pattern-matching like [3 9J less suited for our purposes. 
Moreover, unlike imperative programming languages with a well-developed set 
of control keywords (if, while, repeat, switch, etc.) control structures in Pro- 
log are less explicit. This hinders the application of textual pattern-matching 
approaches to logic programming. 

As a second example, let us consider two predicates that are not duplicates 
but that are nevertheless similar in the sense that they contain some common 
functionality: 

rev_all([] , [] ) . 

rev_all([X|Xs] , [Y|Ys]) :- reverse(X,Y) , rev_all(Xs,Ys) . 



addl_and_sqr ( [],[]). 

addl_and_sqr([X|Xs] , [Y|Ys]) :- N is X + 1, Y is N*N, addl_and_sqr(Xs,Ys) . 



These definitions implement two different relations: rev_all reverses all the el- 
ements of an input list, while addl_and_sqr transforms each element x of an 
input list into (x + l) 2 . They nevertheless have a common core and if we assume 
a language with higher-order capabilities (as for example in t 5| ) , one can extract 
or generalize the common functionality into a map/3-like predicate and translate 
each call to rev_all/2 and addl_and_sqr/2 into an appropriate call to map/3, 
providing the code specific to rev_all/2 or addl_and_sqr/2 as an argument. 

In |21j we have given a formal characterization of code duplication in the 
sense outlined above. While the associated analysis, which basically tries to 
establish an isomorphism between two predicate definitions by comparing every 
possible pair of subgoals, can be used to search for duplication, its complexity 
renders it hard if not impossible to use in practice. Worse, the analysis is not 
quantitative: even though it may find some common functionality between two or 
more predicate definitions, it has no way of indicating how similar the definitions 
are. Yet, this is important if the analysis is to be used in a practical tool since 
not every pair of predicates that share some common functionality is susceptible 
to generalization. 

In this work, we revise the notion of code duplication for logic programs. In 
a first step, we formally define a quantitative measure that reflects the similarity 
between two predicate definitions. In contrast with earlier work [21j . this allows 
us not only to detect predicate definitions that are duplicates of one another, 
but it also provides us with a meaningful indication about how much code is 
common between two predicate definitions. In a second step, we show how to 
compute a so-called fingerprint for every predicate in the software system under 
consideration. Such a fingerprint captures in a single value those characteristics 
of a predicate that are significant when searching for duplicated (or common) 
functionality while it abstracts those characteristics that are much less relevant 
during the search. Our domain of fingerprint values is such that 1) duplicated 
predicates are mapped onto the same fingerprint, and 2) an order relation can be 
defined on fingerprint values that reflect the degree of "similarity" between the 
corresponding predicates. Predicates whose fingerprints are "close" to one an- 
other in the order are likely to share a common structure and hence are potential 
candidates for generalization. 

2 Basic Definitions 

In what follows, we assume the reader to be familiar with the basic logic program- 
ming concepts as they are found, for example, in [1114] . As usual, variable names 
will be represented by uppercase symbols X, Y, . . . whereas predicate and func- 
tion symbols by lowercase letters. Unless noted otherwise, we will use p,q,r,... 
to refer to predicate names and f,g,h,... to refer to function names. 

We restrict ourselves to definite programs. In particular, we consider a pro- 
gram to be defined as a set of clauses of the form H <— B\ A . . . A B n with H 
an atom and Si A ... A B n a conjunction of atoms. In our examples we also 
use the Prolog-style notation for clauses, i.e., we write : — instead of <— and 



, instead of A. A goal is a conjunction of atoms. Given a goal B% A . . . A B„ , 
we write {B±, . . . , B n } to represent the multiset of atoms occurring in it. A goal 
A\ A. . .AA m is called a subgoaloi a goal i?iA. . .A£?„ if the multiset {Ai, . . . , A m } 
is a submultiset of the multiset {B\, . . . ,B n }. Given a particular clause c, we 
denote by head(c) and body{c) the atom and the conjunction of atoms that con- 
stitute the head and the body of the clause, respectively. For an atom A, we 
denote by pred(A) the predicate symbol used in A. Given a predicate symbol 
p/n we denote by Clauses(j> / n) the set of clauses c such that pred(head(c)) 
coincides with p/n. 

Predicates can be mutually recursive. Therefore, rather than considering in- 
dividual predicate definitions, we will consider strongly connected components 
in the predicate dependency graph. Since strongly connected components can 
be seen as equivalence classes with respect to the "depends on" relation [2] the 
strongly connected component (SCC) of a predicate p/n is denoted [p/n]. Given 
a strongly connected component [p/n], we denote by Clauses([p/n]) the union 
of the sets of clauses Clauses(q/m) for all q/m G [p/n]. In what follows we will 
often drop the arity from predicate symbols and we will write p (and likewise 
[p]) instead of p/n (and [p/n]). 

We will often need to refer to those atoms in the body of a clause that 
represent a (direct or indirect) recursive call. Given a clause c an atom Bi in 
body(c) is called a recursive call if pred(Bi) belongs to [pred(head(c))]. Moreover, 
we will represent a clause c € Clauses([p\) as 

A a <- Q x A Ai A . . . A Q k A A k A Q k+1 

where Ai (0 < i < k) is a recursive call and Qi (1 < i < k + 1) is a (possibly 
empty) conjunction of atoms such that none of the conjuncts is a recursive call. 

A variable renaming is a bijective mapping from variables onto variables. For 
any mapping / : I h 7, we denote by f\ D the restriction of the mapping to the 
domain D C X. The inverse of any mapping / is denoted by / . We use the 
notation {x\/y\, . . . , x n /y n } to explicitly represent a mapping / : X i— > Y with 
dom(f) = {xx, . . . , x n } and yt = f(xi) Vi. 

For any syntactic entity E (be it a term, atom, goal or clause), we use vars(E) 
to denote the set of variables occurring in E. As usual, a substitution is defined 
as a finite mapping from distinct variables to terms. Substitutions are usually 
denoted by Greek letters such as cr, 6, . . . and for a syntactic entity E and sub- 
stitution 9 we denote by E9 the result of applying 9 to E. Given two syntactic 
entities E\ and Ei a generalization of E± and Ei is a syntactic entity E such 
that there exist substitutions u\ and o~i where Eo~\ = E\ and Eoi = E%. A 
most specific generalization (msg) of E\ and E% is a generalization E of E\ and 
E2 such that for any generalization E 1 of E\ and E2 there exists a substitution 
o~ such that E = E'a. One can show the existence of a unique most specific 
generalization (up to variable renaming). 

We characterize the size of a term by counting the number of (internal) 
nodes in the term's tree representation and the number of leaves, corresponding 
to constants. To that extent, if we denote by Term the set of terms, we define 



the following mapping nodes : Term t-^ N: 



nodes(X) = 

nodes(f(ti, . . . , t n )) = 1 + Yh=i nodesfc). 

Since goals and clauses can be considered terms constructed by the A and «— 
functors, we will use the above measure to characterize the size of goals and 
clauses as well. By extension, when considering a strongly connected component 
[p], we define nodes{[p]) = E ce ciau Ses ([ P ]) nodes(c). 



3 Identifying duplication and similarity 
3.1 Comparing goals 

In what follows, we define a quantitative measure that represents the degree of 
similarity between two goals. As a starting point, let us define a measure that 
compares two goals in a purely syntactic way, by simply counting the number of 
common nodes in the goals' term representations. 

Definition 1. Given a pair of goals Q\ = A\ A . . . A A n and Q2 = A\ A . . . A A' n , 
we define the strict commonality between Q\ and Q2 as the natural number, 
denoted c(Q\,Q '2), which is defined as follows: 

c((Ai A A 2 A ... A A n ), (A[ A A' 2 A ... A A'J) = 1 + c{A 1 ,A' 1 ) 

+c(A 2 ... A A n ,A' 2 ... AA' n ) 
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Example 1. Consider the following goals: Qi = p(f (X) ,g(Y,h(Z, a) )) , q(Z,X) 
and Q 2 = p(f(T) ,g(T,h(Z,b))) , q(Z,T). Then, the strict commonality be- 
tween them is c(Qi, Q2) = 8. 

Note that for two goals Q\ and Q2, the strict commonality c(Q\,Q2) quan- 
tifies the amount of structure that would be preserved upon taking the most 
specific generalization of Qi and Q2- 

Lemma 1. Let Q\ and Q2 be as required in Definition^ Then, c{Q\,Q2) = 
nodes(msg(Qi, Q2))+S where 5 represents the number of occurrences of identical 
variables that appear in identical positions in the tree representation of Qi and 

Qi. 



Proof. The proof is done inductively on the structure of Q\ and Q2- 



Example 2. For goals Q\ and Qi in Example [T] we have 

msg(Q 1 ,Q 2 ) = p(f (_) ,g(_,h(Z,_))) , q(Z,_) 

where _ denotes an anonymous variable. Then, nodes(rnsg(Q\,Q2)) = 6. Since, 
vars(Qi) n vars(Q2) = {Z} and Z appears twice in the msg, we have 6 = 2. 
Indeed, c(Q\, Q2) = 8 = 6 + 2 = nodes(msg(Qi, Q2)) + $■ 

Let us now extend the above definition in such a way that: 1) it deals with 
goals that do not have an equal number of atoms, 2) it takes commutativity of 
the A operator into account (that is, we want to consider commonality modulo 
atom reordering), and 3) it abstracts from concrete variable names in the goals 
while retaining the sharing information. The resulting measure will reflect, for 
two arbitrary goals, the maximal amount of structure that can be preserved by 
generalizing a suitable reordering and renaming of both goals. 

When comparing two arbitrary goals, we will focus on those subgoals that 
are similarly structured, that is, subgoals that correspond to sets of atoms that 
basically represent calls to the same predicates. More formally: 

Definition 2. Let Q\ and Q2 be two goals. We say that (Qi,^) * s a V a i r °f 
similarly structured subgoals of Qi and Q2 iff Q\ is a subgoal of Q\, Q' 2 is a 
subgoal ofQ2, and II{Q' 1 ) = n(Q' 2 ) where LJ{Q) denotes the multiset of predicate 
symbols occurring in a goal Q. 

Note that by definition, two similarly structured subgoals always comprise an 
equal number of atoms as well as an equal number of calls to a particular pred- 
icate. Furthermore, we say that a pair {Q\, Q' 2 ) of similarly structured subgoals 
of Qi and Q2 is maximal iff there does not exist another pair of similarly struc- 
tured subgoals (Q'{, Q'2) of Qi and Q2 such that Q[ C Q'{ and Q' 2 C Q' 2 \ where 
C denotes the subgoal relation. 

Example 3. Consider the goals Q\ = p(a,f(A)), s(A), q(A,B) and Q2 = 
q(Y,Z), p(f(X),Y), r (Z , S) . The pair (Q[, Q' 2 ) with = p (a, f (A) ) , q(A,B) 
and Q' 2 = q(Y,Z) , p(f (X) ,Y) is a maximal pair of similarly structured subgoals 
of Qi and Q2- 

Note that for any given pair of goals, there always exists at least one maximal 
pair of similarly structured subgoals. If the goals do not contain a call to the same 
predicate, the (unique) maximal pair of similarly structured goals is (□, □) with 
□ denoting the empty goal. Moreover, since goals are considered as multisets 
a maximal pair of similarly structured subgoals is always unique. Observe that 
atoms not being part of the maximal similarly structured subgoals can never be 
part of a generalization of both goals. Hence our interest in maximal similarly 
structured subgoals. 

When comparing similarly structured subgoals, we would like to abstract 
from actual variable names, while retaining sharing information between goals. 
In other words, we would like our measure to return a higher value in the case 



where one of the goals being compared is a renaming of the other (i.e. it presents 
the same dataflow). Take for example the goals 



Since G is a renaming of G, their msg would be a goal identical to either of them 
(up to renaming). By contrast, the msg of G and G" is identical (up to renaming) 
to only G" . To reflect this observation when measuring the similarity of two goals 
Qi and Qi in our measure, we define a set of variable renamings between the 
two goals as follows. For convenience, we assume that vars{Qi) n varsiQ^) = 
and that f^vars(Q\) < #vars(Q2), where #S denotes the cardinality of the set 



Definition 3. Let (Qi,Q2) be a pair of similarly structured (sub)goals with 
#vars(Qi) < ffvars(Q2) ■ We define R(Qi,Q2) as the (finite) set of infective 
mappings from vars(Qi) onto vars(Q2)- 

Example 4- Let us take Q[ and Q' 2 as in Example [3] above. Then we have 
vars(Q[) — {A, B}, vars{Q' 2 ) — {X,Y,Z} and consequently the set R(Qi,Q2) 
comprises the following injective mappings: 



We can now define the commonality between a pair of similarly structured 
subgoals Qi and Q 2 as the maximal strict commonality one could obtain by 
changing the order of the atom^j] and renaming: 

Definition 4. Let Q\ and Q 2 be two similarly structured (sub)goals such that 
ffvars(Qi) < ffvars(Q 2 ). The commonality between Q\ and Q 2 , which we de- 
note by C{Qi,Q 2 ), is defined as: 

C(Qi,Q 2 ) — max{c(Qip,Q' 2 ) | Q' 2 is a permutation of Q 2 and p 6 R(Qi,Q 2 )}. 

Similarly, if Qi and Q 2 are such that f^vars{Q\) > f^vars(Q 2 ), then C(Qi, Q 2 ) 
is defined as C(Q 2 ,Q\). 

We are now ready to define the similarity between two arbitrary goals, which 
we define as the commonality between the goals' maximal pair of similarly struc- 
tured subgoals: 

Definition 5. Let Q\ and Q 2 be two goals such that f£vars(Qi) < #vars(Q 2 ). 
The similarity between Q\ and Q 2 , denoted o~{Q\, Q2), is defined as: cr(Qi, Q2) — 
C(Qi, Q' 2 ), where (Qi, Q 2 ) is the maximal pair of similarly structured subgoals of 
Qi andQ 2 . IfQi andQ 2 are such that #vars(Qi) > ffvars(Q 2 ), thena{Q\ 1 Q 2 ) 
is defined as cr(Q2,Qi)- 



G =p(X,Y), q(Y,Z) 
G' =p(A,B), q(B,C) 
G" = p(A,B), q(C,D) 



S. 




1 We call permutation of a goal G any goal obtained by reordering the atoms of G. 



Example 5. Let Q\, Q 2 and Q' 11 Q' 2 as in Example [31 For the similarity between 
Qi and Q 2 , we have: 



a(Q 1 ,Q 2 )=C(Q' 1 ,Q' 2 ) 

= max{c(Q' 1( o, Q 2 ') | <2 2 ' i s a permutation of Q' 2 and p G R{Q' 1: Q 2 )}- 

One can easily see that the maximal value is 5 and it is obtained for p = {A — ► 
Y,B^Z} andQ2 = p(f(X),Y), q(Y,Z). 

From the above example, it can easily be seen that our notion of similarity 
between arbitrary goals reflects indeed the maximal amount of structure that one 
could preserve by taking the most specific generalization of the goals' maximal 
pair of similarly structured subgoals after renaming and reordering the atoms. 

Corollary 1. Let Qi and Q 2 be arbitrary goals and let (Q'i,Q' 2 ) their maxi- 
mal pair of similarly structured subgoals. Let Q 2 be a permutation of Q' 2 as in 
Definition^ such that cr(Qi,Q 2 ) = c(Q'iP, Q'i)- Then, 

o-(Qi, Q2) = c(Q[p, Q2) = nodesimsgiQ^p, Q 2 ')) + 8 

where 8 denotes the number of identical variables that occur at identical positions 
in Q[p and Q 2 . 

In particular, if goals Qi and Q 2 are identical modulo atom reordering and 
renaming, then cr(Qi, Q 2 ) equals the total number of nodes (including the vari- 
ables) in the term representation of either goal. 

3.2 Comparing predicate definitions 

We will now extend the notion of similarity from individual goals to complete 
predicate definitions. When comparing predicate definitions, we will abstract 
from the order of the arguments in each definition. To that extent, we define the 
notion of an argument permutation as follows: 

Definition 6. Given two n-ary predicates p/n and q/n. An argument permu- 
tation between p and q is a bijective mapping {1, . . . , n} 1— ► {1, . . . , n}. 

Note that an argument permutation only exists between predicates having the 
same arity. In order to consider two predicate definitions as being similar, we 
impose the condition that both definitions have the same recursive structure. 
By this, we mean that there exists a one-to-one mapping between the clauses 
in both definitions such that 1) corresponding clauses have the same number 
of recursive calls and 2) the corresponding recursive calls are identical up to a 
renaming of the variables, a renaming of the recursive calls and a permutation of 
the argument positions. Since predicates can be mutually recursive, let us first 
formally state the notion of a clause mapping between two strongly connected 
components. 



Definition 7. Let \p/n] and [p 1 /n] be two strongly connected components. A 
clause mapping between [p/n] and [p' /n] is a bijective mapping ip with dom(ip) = 
Clauses([p/n\) andrange{p) — Clauses([p' /n]) such that for any clauses C\,C2 G 
Clauses([p/n}), we have pred{head{c\)) — pred(head(c.2)) pred{head{p(c\))) = 
pred(head(ip(c2))) ■ 

A clause mapping establishes a 1-1 correspondence between the clauses of two 
strongly connected components. Note that such a clause mapping implicitly de- 
fines a bijective mapping between the predicates of the components. Slightly 
abusing notation, if p is a clause mapping between [p/n] and [p 1 /n] and q/m E 
[p/n], we will use p>(q) to denote the predicate in [p'/n] whose definition corre- 
sponds (by ip) to the definition of q. 

Example 6. Consider the predicates append/3 and concat/3 from the introduc- 
tion. The mapping ip mapping the i'th clause of append/3 onto the i'th clause of 
concat/3 (for i = 1, 2) is a clause mapping between [append/3] and [concat/3]. 

Given a clause mapping between two strongly connected components, we can 
formally state the conditions under which both components are considered to 
have the same recursive structure. 

Definition 8. Let [p] and [p 1 ] be two strongly connected components and p a 
clause mapping between [p] and [p 1 ]. We say that [p] and [p'} have the same 
recursive structure w.r.t. <p if and only if the following holds: 1 ) for any predicate 
q G [p] there exists an argument permutation n q between q and ip{q) and 2) for 
any clause c G Clauses{[p]) of the form 

An <— Qi, Ai, . . . , Qk, Ak, Qk+i 

the corresponding clause p(c) G [p'} is of the form 

Aq <— Qi, A 1 , . . . , Q' k , A' k , Q' k+1 

and there exists a variable renaming p of c such that for every Ai ( with < 
i < k) we have that if Ai = q{t\, . . . ,t rn ) for some predicate q/m, then A[ — 

The above definition implies that two predicates have the same recursive 
structure if there exists a clause mapping between them such that the corre- 
sponding clauses contain the same number of recursive calls and there exists an 
argument permutation that renders the corresponding calls identical (modulo 
a variable renaming). The same must hold for the heads of the corresponding 
clauses. When considering strongly connected components rather than individual 
predicates, the same must hold for each pair of corresponding predicates. 

Example 7. The append/3 and concat/3 predicates from the introduction have 
the same recursive structure. Indeed, take the clause mapping ip from Example^ 
and take for the argument permutation between append/3 and concat/3 the 
mapping 7r = {(1, 2), (2, 3), (3, 1)} and for the renaming p = {X/E, Xs/Es}. 



Note that Definition [H] is not restricted to recursive predicates. Two non- 
recursive predicates are characterized as having the same recursive structure 
if there exists an argument permutation between both predicates that makes 
the heads of the corresponding clauses identical modulo renaming. Also note 
that, in principle at least, there might exist several clause mappings between 
two predicates (or SCCs) under which the predicates (or SCCs) have the same 
recursive structure. 

We are now ready to define the similarity between two strongly connected 
components. As said before, we only consider strongly connected components 
that have the same recursive structure. 

Definition 9. Let [p] and [p'] be two strongly connected components that have 
the same recursive structure w.r.t. a clause mapping (p. The similarity between 
[p] and [p'\ w.r.t. if, denoted by ct([p], is defined as 

a([p], \p'U) = C 1 + £-=i a ^ 20 + E to AQ) 

c£Clauses([p]) 

if c and (p(c) are clauses of the form 

Ao <— Qi,Ax, ■ ■ ■ , Qk, Ak, Qk+i 

and 

A) *~ Qn^n ■ • • ; Qk> A k , Qk+\ 
respectively and A'! is q 1 (t Wq{1) , . . .,t Wq ^)p, if A, = q(t 1 , . . .,t m ), pred{A' i ) = q' , 
and n q and p refer to the required argument permutation and renaming from 
Definition^ 

In other words, the similarity between two predicate definitions (or SCCs) 
is defined as the sum of the similarities between the corresponding clauses; the 
similarity between a pair of clauses comprises two main parts: 1) sum of the 
similarities between each pair of corresponding non-recursive subgoals, and 2) 
the sum of the commonalities between the heads and the corresponding recursive 
subgoals. Note that in order to compute the latter, we need to account for the 
difference in predicate names and the possible permutation of the arguments 
(hence the use of A"). Also note that, for each clause, we add 1 to reflect the 
node represented by the :- functor in the clause's term representation. One can 
show that a statement similar to Corollary [1] holds for clauses and strongly 
connected components. 

Example 8. Let us reconsider the append/3 and concat/3 predicates from the 
introduction. Their definitions have the same recursive structure w.r.t. if from 
Example [7] (to see this take 7r ap p and p as in Example [7]). None of their clauses 
contain non-recursive subgoals, hence when computing the similarity between 
both definitions, we sum, for each clause, the commonalities between the heads 
and recursive calls. We have 

c(concat(L, [] ,L), concat(L, [] ,L)) = 4 

c(concat( [ElZs] , [E|Es] , Y), concat ( [E I Zs] , [E|Es] ,Y)) = 8 
c(concat (Zs ,Es , Y), concat(Zs,Es,Y)) = 4 



Hence, we obtain <r( [append], [concat], ip) = (1 + 4) + (1 + 8 + 4) = 18. 



Example 9. Consider the rev_all and addl_and_sqr predicates from the intro- 
duction. One can easily verify that both predicates have the same recursive 
structure; the required clause mapping, argument permutation and renaming 
are all the identical mapping. With respect to the similarity between the two 
definitions, it is clear that the corresponding non-recursive subgoals have no 
similarly structured subgoals, hence we have 

c(addl_and_sqr( [] , [] ), addl_and_sqr ( [] , [] )) =3 
c(addl_and_sqr( [XlXs] , [Y|Ys]), addl_and_sqr ( [X I Xs] , [Y|Ys])) = 7 
c(addl_and_sqr(Xs,Ys), addl_and_sqr(Xs,Ys)) =3 
o-(reverse(X,Y), (N is X+l, Y is N*N)) =0 

Hence, we obtain cr([rev_all], [addl_and_sqr], ip) = (1 + 3) + (1 + 7 + 3 + 0) = 15. 

Intuitively, it is clear that the notion of similarity between predicate defi- 
nitions represents the number of nodes that are common to both term repre- 
sentations of the involved predicates. Our notion is quite liberal in the sense 
that it allows for: 1) renaming of the involved predicate and variable names, 2) 
permutation of the arguments, and 3) permutation of the body atoms within 
each non-recursive subgoal. Moreover, by relating the similarity between two 
predicate definitions to the total number of nodes that are effectively present 
in each of the definitions' term representations, we obtain an indication of how 
close each definition is to some most specific generalization of both definitions. 

Definition 10. Let [p] and [p'\ be two strongly connected components that have 
the same recursive structure with respect to some clause mapping ip. The close- 
ness between [p] and [p 1 ], denoted ^([p], [p'}), is defined as the pair 



where m — c([p], [p'], p) and (or N^,,^) represent the total number of nodes 
in the term representations of the predicates in [p] (or [p'])- 

Note that the closeness as defined by the definition above is a pair of values be- 
tween and 1. Also note that it is (1,1) in case the predicates under consideration 
are duplicates. 

Example 10. One can easily verify that the (total) number of nodes in both the 
term representations of the append and concat definitions is 18. Therefore, from 
Example [8] it follows that the closeness between them is (1, 1), indicating they 
are duplicates. The number of nodes in rev_all and addl_and_sqr is, respec- 
tively, 19 and 25. By Example [9l it follows that the closeness between them is 
(0, 79, 0.6). These numbers indicate how close each of these definitions is to the 
code structure that is common to both of them, which we could represent by the 
following definition: 




mp(A,B) : - A = [] , B = [] . 

mp(A,B):- A = [X|Xs], B = [Y|Ys], mp(Xs,Ys). 



Generalizing the examples given above, we conjecture that, under certain 
conditions, the closeness between predicates is a useful indication on how much 
duplicated code is contained in their definitions. 

3.3 Discussion 

In the preceding sections, we have defined the notions that allow to characterize 
the similarity between predicate definitions. A necessary condition for predicates 
to be considered similar is that they have the same recursive structure. Defi- 
nition [8] requires that for each pair of corresponding clauses, the corresponding 
recursive calls (or heads) contain the same terms as arguments (modulo an argu- 
ment permutation and variable renaming). While this might seem overly restric- 
tive, a possible remedy is to compute similarities on programs in a normal form 
where each atom is of the form: p(Xi, . . . , X n ), X = Y or X = f(X%, . . . , X n ) 
(with X, Y, Xi, . . . , X n different variables). Let us reconsider the append and 
cone at definitions, this time in normal form: 

append (X, Y,Z) :- X = [] , Z = Y. 

append (X, Y,Z) :- X = [XelXs], Z = [Xe I Zs] , append(Xs,Y,Zs) . 

concat(A,B,C) :- B = [] , A = C. 

concat(A,B,C) :- A = [Be I As] , B = [Be I Bs] , concat(As,Bs,C) . 

Note that these definitions still have the same recursive structure. Also note 
that although the computed similarity values will somewhat change due to the 
presence of the extra body atoms, the similarities will remain identical for both 
predicates, and thus the closeness between them will still be (1, 1), indicating 
they are duplicates. Changing the order of the unifications in one of the defini- 
tions does not influence the computed numbers as these are independent of the 
order of the body atoms in the non-recursive subgoals. However note that our 
definitions only capture permutations of body atoms that arc confined within 
a single non-recursive subgoal. Take for example the definition of append from 
above, where we move the unification Z = [Xe I Zs] over the recursive call: 

append (X , Y, Z) :- X = [], Z = Y. 

append (X , Y, Z) :- X = [Xe I Xs] , append(Xs,Y,Zs) , Z = [Xe I Zs] . 

By Definition [8l this version of append still exhibits the same recursive structure 
as the concat predicate above. Nevertheless, the similarity between the defini- 
tions will be significantly lower, since the corresponding non-recursive subgoals 
contain a different number of unifications. 

We believe that restricting the computation of similarity to corresponding 
non-recursive subgoals does not impose a real limitation. In fact, moving a com- 
putation over a recursive call usually represents a significant change in program 
(and computation) structure that goes beyond the changes in program structure 
that we would like our technique to be able to detect. 

The computation of similarities lends itself to a top-down calculation. Indeed, 
one can first compute what predicates have the same recursive structure. Next, 



for each pair of predicates having the same recursive structure one can com- 
pute the similarities between each pair of corresponding non-recursive subgoals. 
Complexity of such an algorithm is quadratic in the number of predicates. 

In the following section, we present a more efficient technique approximating 
the computation of similarities. The idea is to compute, for each predicate defi- 
nition in isolation, a so-called fingerprint. Such a fingerprint captures in a single 
value those characteristics of a predicate that are significant when searching 
for duplicated (or common) functionality while it abstracts from those char- 
acteristics that are less relevant during the search. The computation of these 
fingerprints does not require any comparison between the definitions of different 
predicates. Comparing fingerprints is considerably easier than comparing predi- 
cate definitions and we believe that the result provides a useful indication about 
what predicates are possible candidates for a more thorough comparison |21j . 

4 Fingerprinting logic programs 

In what follows, we will map a predicate to a so-called fingerprint. The fingerprint 
of a predicate is a value that is constructed in such a way that (1) it reflects 
the recursive structure of the predicate, (2) predicates that are duplicates are 
mapped onto the same value, and (3) the more predicates are similar, the closer 
the values of their fingerprints. Clearly, fingerprints can be seen as abstractions 
(cf. [61718] . In what follows, we consider programs in normal form as defined 
above; that is, every atom in the program is of the form p(X 1 , . . . , X n ), X = Y 
or X — f(Yi,...,Y n ). We proceed in a stepwise fashion and define domains of 
fingerprints over goals, clauses and predicates. For each category we define an 
order relation over the introduced domain. 

The basic idea behind our fingerprinting technique is to abstract a goal by 
counting the number of occurrences of each function and predicate symbol. 

Definition 11. Let A be an alphabet, and Fa, Hai Qa respectively, the cor- 
responding sets of function symbols, predicate symbols and normalized goals. 
The goalprint function ip g associates every goal Q G with a total function 
ip(Q) : (F U II U {=}) i— > N, called the goalprint of Q, such that: 

— (p g (p(Yi, . . . , Y n ))(h) — 1 if h is p and 0, otherwise; 

— <p g {X = f{Y\, . . . , Y n ))(h) = 1 if h is f or h is =, and ; otherwise; 

— <fi g (Qi A Q 2 )(h) = <p g (Qi)(h) + <p g (Q2)(h) for all he Fun. 

The set of all goalprints over A, i.e., cp g (Q.A)> * s denoted GP\- 

Computing the goalprint associated to a goal is straightforward given that the 
predicate definitions are in normal form. Observe that by computing a goalprint, 
we ignore the order of the atoms in the goal and the sharing between them. For 
a given alphabet, all goalprints range over the same domain. Hence, we define 
the following (total) order on GPx- Let (pi,<p2 G GP, we say <p\ < ip 2 if and only 
if V/ G (dom((pi) = dom((p2)) we have that ipi(f) < <^i{f)- 



Example 11. Consider the goal X = [A\As],As = [B\Bslp(A, B, C). A goal- 
print if for this goal would be tp = {([|],2), ((=),2), (p, 1)}0 



Let us conjecture the following result, relating the greatest lowerbound of two 
goalprints to the generalization (and thus similarity) of the concerned goals. 

Conjecture 1. Let Q\ and Q2 be arbitrary goals and let {Q\, Q'2) their maximal 
pair of similarly structured subgoals. Let Q' 2 ' be a permutation of Q' 2 and p a 
renaming as in Definition [3] such that cr(Qi, Q2) = c(Q±p, Q 2 ')- Then we have 

<p(Q 1 )n<p(Q 2 ) = <p(msg(Q[,Q 2 ')). 

This can easily be seen, as the greatest lowerbound f{Q\) n (p(Q2) indicates 
precisely the number of occurrences of each predicate and function symbol shared 
by Qi and Q2 (and which are hence part of their most specific generalization). 
As a special case, note that if nodes(msg(Q' 1 ,Q , 2)) — nodes(Q\) = nodes (Q 2), 
that then <p{Qi) = (f(Q2)- In other words, duplicated goals will have identical 
goalprints. Note that the converse does not necessarily hold: 

Example 12. the goals Q x : X = f (Y) , Y = f (Z) and Q 2 : A = f (B) , C = 
f (B) have identical goalprints, yet o-(Qi,Q 2 ) does not equals nodes(Qi) nor 
nodes (Q 2). Indeed, o-(Qx,Q 2 ) = 7 whereas nodes(Q\) = nodes(Q2) = 9. 

We will now use the notion of goalprint to construct fingerprints of clauses 
and predicates. When abstracting a single clause of the form 

Ao <— Qi f Ai, . . . , Qk, Ak, Qk+i 

we keep track of the individual abstractions of the non-recursive subgoals Qi. 
Therefore, we define the fingerprint of a clause as a sequence of goalprints: one 
for every (maximal) non-recursive subgoal of the clause body. 

Definition 12. Let A be an alphabet and P4 be a program over the alphabet. A 
clauseprint function ip c maps every clause Aq 4— Qi, A\, . . . , Qk, A^, Qk+i to a 
sequence of goalprints (tp g (Qi), ■ ■ ■ , fg(Qk+i)) , called a clauseprint. The set of 
all such clauseprints, i.e., ip c (PjC) is denoted by CF\. That is CPa C GPjC '■ 

Since our primary interest is in comparing clauses having the same recursive 
structure, we define the following partial order on clauseprints. Let <p*,<p% G CP4; 
we define ip\ < ip\ if and only if (p* = {ipi, . . . , ip n ) and ip\ — (ipi, . . . , ip' n ) for 
some n G N and <pi ^ <p\ for all 1 < i < n. 

Example 13. Reconsider the definitions of append and concat in normal form. 
The first clause of both predicates can be characterized by the clauseprint (<^i) 
with ipi = {([], 1), ((=), 2)}; the second clauses by (<f2 1,^2,2) with <f2i = 
{([|],2),((=),2)}and^ 2 , 2 = {}. 



2 We leave the alphabet implicit and assume that a goalprint associates to every 
function or predicate symbol that is not explicitly mentioned. 



Finally, the fingerprint of a predicate is denned as a function associating a 
clauseprint to each of the clauses in the predicate's definition. 

Definition 13. Let A be an alphabet, let Pa be a program over the alphabet 
and lip be the set of predicates in the program. Predicate print function <1> maps 
every predicate p G lip to a multiset {ip c (cj \ c G Clauses(p)}, called a predicate 
print. The set of all predicate prints is denoted PF\. 

Observe that we use multisets rather than sets since different clauses of the 
same predicate can give rise to identical clauseprints. In this case we would like 
the clauseprint to appear twice in the predicate print. 

Example 14- The predicate prints of append and concat from before, denoted 
<P app and <2> cone are defined as 

$app = &conc. = {(Vl), (^2,1, ^2,2}}, 

with ipi, if2,i and 952,2 as in Example 1131 

We define the following partial order on PF\. Let <?i,^2 £ PPa] we define 
< $2 if and only if <Pi(c) ^ ^(c) for all c G LI p. Observe once again that 
the order relation ^ is only defined between fingerprints of predicates having 
the same recursive structure. As a final example, let us reconsider the predicates 
rev_all and addl_and_sqr in normal form: 

Example 15. 

rev_all(A,B) :- A = [] , B = [] . 

rev_all(A,B) :- A = [X I Xs] , B = [Y I Ys] , reverse (X , Y) , rev_all(Xs,Ys) . 
addl_and_sqr(A,B) :- A = [] , B = [] . 

addl_and_sqr(A,B) :- A = [X|Xs], B = [Y|Ys], N is X + 1, Y is N*N, 
addl_and_sqr(Xs,Ys) . 

The associated predicate prints are <P r a and <P a as defined as: 

<P ra = {({([], 2), ((=), 2)}), <{([|], 2), ((=), 2), (reverse, 1)}, {})} 

$ aas = {({(Q, 2), ((=), 2)}), <{([|], 2), ((=), 2), {is, 2), (+,1),(*,1)},{}}} 

Both predicate prints are comparable and computing their greatest lowerbound, 
i.e. <!> — <P ra n <Paas gives us the following predicate print = {({([], 2), ((= 
), 2)}), ({([|], 2), ((=), 2)}, {})}, which corresponds indeed to the fingerprint of 
the mp/2 predicate from Example [TQl reflecting the common code structure of 
the rev_all and addl_and_sqr predicates. 

Similarly, we define an SCC-print function <S> and an SCC-print for an SCC 
[p] as a multiset of predicate prints corresponding to all predicates in [p] . 

Lemma 2. Let \p\ and [p'\ be two strongly connected components that have the 
same recursive structure with respect to some clause mapping (p. If 7([p], [p 1 ]) = 
(1,1) then$(\p])=$(\p']). ' 



5 Discussion and ongoing work 



Conjecture prelates the greatest lowerbound of two goalprints to the most spe- 
cific generalization of the goals (after renaming and atom reordering) and thus 
to their similarity. An interesting topic of future work is to extend these results 
to complete predicate definitions. Doing so requires a formal characterization of 
the most specific generalization of two predicates (as always modulo renaming 
and atom reordering). As suggested by Examples UM and [Tol we conjecture that 
the greatest lowerbound of two predicate prints neatly characterizes the similar- 
ity between the predicates and thus their common code. Literal code duplication 
is reduced to a special case since duplicated predicates (having closeness (1,1)) 
have identical fingerprints. 

Other topics of future work include to adapt the techniques proposed in 
this paper to Prolog. This requires, among others, to constrain the notion of 
clause mapping (to fix the order in which clauses must be mapped onto each 
other) and to limit the amount of reordering permitted when computing the 
similarity between non-recursive subgoals. Finally, we intend to investigate the 
relation with fingerprinting techniques used for the detection of plagiarism, like 
e.g. [19;, and to make a prototype implementation of our proposed technique 
and to evaluate its effectiveness and performance on a testbed of programs. 
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